
clear all
set more off

***************************************************************************************************
*********************Get test score data***********************************************************
***************************************************************************************************
*Inputs: CST test scores which go from 2002-03 through 2012-13 then SBAC 2014-15 to 2016-17 (note 2013-14 is missing -- this is due to test change)
*1. CST test files from 2002-2012
*2. SBAT test scors from 2014-2016

*Outputs:
*1. "VA_test_scores.dta"
***************************************************************************************************
***************************************************************************************************
***************************************************************************************************

*Starting with the CST test scores which go from 2002-03 through 2012-13.
*This data *reliably* covers: Math for grades 2-6 (in grade 7 students could choose to take CST Math or Algebra test). ELA reliably covers grades 2-8.
*Note that I keep grade 7 math, although it may not be entirely reliable due to Algebra/Math choice

****Step 1. Load CST Test Scores (dropping uneeded science and social science results to save space****
clear all
foreach n of numlist 2(1)8{
local m=`n'+1
append using "/data/CST/CST 200`n'-200`m'.dta"
*keep only math and ELA
drop if tstgroupname=="SCIENCE" | tstgroupname=="SOCIAL SCIENCE" | tstgroupname=="WRITING"
}
append using "/data/CST/CST 2009-2010.dta"
*keep only math and ELA
drop if tstgroupname=="SCIENCE" | tstgroupname=="SOCIAL SCIENCE" | tstgroupname=="WRITING"
foreach n of numlist 0(1)2{
local m=`n'+1
append using "/data/CST/CST 201`n'-201`m'.dta"
*keep only math and ELA
drop if tstgroupname=="SCIENCE" | tstgroupname=="SOCIAL SCIENCE" | tstgroupname=="WRITING"
}

*Parent education is in demographic data so will drop
drop eduservctrname parentedulevelname parenteducationlvlcode

*Drop uneeded upper year math and ELA tests
drop if tsttypename=="ALGEBRA I" | tsttypename=="ALGEBRA II" | tsttypename=="GENERAL MATHEMATICS (GRADES 8" | tsttypename=="GENERAL MATHEMATICS (GRADES 8 & 9)" | tsttypename=="INTEGRATED MATH 1" | tsttypename=="INTEGRATED MATH 2" | tsttypename=="INTEGRATED MATH 3" | tsttypename=="GEOMETRY" | tsttypename=="HS MATHEMATICS"
drop if tsttypename=="ENGLISH LANGUAGE ARTS (GR 11)" | tsttypename=="ENGLISH LANGUAGE ARTS (GR 10)" | tsttypename=="ENGLISH LANGUAGE ARTS (GR 9)"

*Generate the grade of the test
gen eng_grade=.
foreach n of numlist 2(1)8{
replace eng_grade=`n' if tsttypename=="ENGLISH LANGUAGE ARTS (GR `n')"
}
gen math_grade=.
foreach n of numlist 2(1)7{
replace math_grade=`n' if tsttypename=="MATHEMATICS (GRADE `n')"
}

replace castandardprofleveldescr="FAR BELOW BASIC" if castandardprofleveldescr=="FAR BELOW B"

*Drop if missing student id
drop if std_pseudo_id==.

ren cstscaledscoreamt cstscore

*Drop if missing score
replace cstscore = . if cstscore == 0
drop if cstscore==.

*These students have multiple scores for same test/year.  Cannot determine correct score so just dropping all.  Luckily very few observations are problematic (4k/7.2million)
duplicates tag std_pseudo_id schendyr tstgroupname, gen(g)
drop if g>0 & cstscore==.
drop if g>0
drop g

*Create standardized scores	
egen mean = mean(cstscore), by(tstgroupname tsttypename schendyr) 
egen sd   = sd(cstscore), by(tstgroupname tsttypename schendyr)
gen  zcst = (cstscore - mean)/sd
drop mean sd

drop tsttypename
ren castandardprofleveldescr prof_level

*Reshape the data from student-subject-year to student-year
reshape wide cstscore zcst prof_level eng_grade math_grade, i(std_pseudo_id schendyr) j(tstgroupname) string
drop math_gradeELA eng_gradeMATH
*Perfect correlation for grades (if took *both tests* always took math and ELA test in same grade)
corr eng_gradeELA math_gradeMATH

gen testing_grade=math_gradeMATH if cstscoreMATH!=.
replace testing_grade=eng_gradeELA if cstscoreELA!=. & testing_grade==.
drop eng_gradeELA math_gradeMATH

*Rename the variables and save. 
ren cstscoreELA ela_score
ren zcstELA ela_scorez
ren cstscoreMATH math_score
ren zcstMATH math_scorez
ren prof_levelMATH prof_level_math
ren prof_levelELA prof_level_ela

compress
save "/data_analysis/Eliso_Complete/Data/cst.dta", replace





*****Step 2. Get the SBAC test scores*****
clear all

*Load the 2017 data
use "/data/SBAC/SBAC 1617.dta"

ren *, lower
ren endyear schoolendyear
ren testexamname testname
drop exclusionflag

*Drop undefined grades
drop if gradecode=="PK" | gradecode=="UNK"
destring gradecode, replace

*Append in the 2014/2016 data (which are both in one data file)
append using "/data/SBAC/1415-1516 SBAC.dta"

*Drop if invalid test score
drop if testtotalflag=="N"
*Drop uneeded variables
drop testtotalflag testname localdistrictcode parentedulevelname

***Rename to CST var names:
ren schoolendyear schendyr
ren studentpseudoid std_pseudo_id
*Keep only grades 3-8
keep if gradecode>=3 & gradecode<=8
gen eng_grade=gradecode if testgroupname=="ELA"
gen math_grade=gradecode if testgroupname=="Math"
ren performanceleveldescription prof_level
ren testgroupname tstgroupname

*Drop if missing student id
drop if std_pseudo_id==.
*No students with multiple scores for same test-year
duplicates r std_pseudo_id schendyr tstgroupname

ren overallscalescore cstscore
replace cstscore = . if cstscore == 0

*Reshape the data from student-subject-year to student-year
reshape wide cstscore prof_level eng_grade math_grade, i(std_pseudo_id schendyr) j(tstgroupname) string
drop eng_gradeMath math_gradeELA

gen testing_grade=math_gradeMath if cstscoreMath!=.
replace testing_grade=eng_gradeELA if cstscoreELA!=. & testing_grade==.

drop math_gradeMath eng_gradeELA gradecode

ren cstscoreELA ela_score
ren cstscoreMath math_score


*Standardize scores*
egen mean = mean(math_score), by(testing_grade schendyr) 
egen sd = sd(math_score), by(testing_grade schendyr) 
gen  math_scorez = (math_score - mean)/sd
drop mean sd
egen mean = mean(ela_score), by(testing_grade schendyr) 
egen sd = sd(ela_score), by(testing_grade schendyr) 
gen  ela_scorez = (ela_score - mean)/sd
drop mean sd

ren prof_levelMath prof_level_math
ren prof_levelELA prof_level_ela

compress

**Append the CST scores on**
append using "/data_analysis/Eliso_Complete/Data/cst.dta"
erase "/data_analysis/Eliso_Complete/Data/cst.dta"

*Rename some variables
ren schendyr year
ren std_pseudo_id stdpseudoid

*Create lagged math and reading scores and grade skipping/repetition indicators*
sort stdpseudoid year
bys stdpseudoid: gen lag_math=math_scorez[_n-1] if year[_n]-1==year[_n-1]
bys stdpseudoid: gen lag_ela=ela_scorez[_n-1] if year[_n]-1==year[_n-1]
bys stdpseudoid: gen repeat_grade=(testing_grade[_n]==testing_grade[_n-1] & year[_n]-1==year[_n-1])
bys stdpseudoid: gen skip_grade=(testing_grade[_n]-2==testing_grade[_n-1] & year[_n]-1==year[_n-1])

*Grade skipping is very rare; will drop
tab skip_grade
drop skip_grade

*Save the output data which contains test scores at the student-year unit of observation
compress
save "/data_analysis/Eliso_Complete/Data/VA_test_scores.dta", replace















